Gradient Descent is used to train some of the most advanced deep learning models. The goal of Gradient Descent:
Gradient Descent is an algorithm used to minimize any function. Usually, you would start with some value of w and b (set w=0, b=0) You keep changing w, b to reduce until we settle at or near a minimum (there may be multiple minima)
The first step is choosing which direction to take your first step. The goal is to take the steepest step towards the minimum.
Your starting values will influence your ending point drastically. This is our gradient descent algorithm:
This is an assignment of w not an assertion! Alpha is the [[Learning Rate]] We want to take the derivative of the cost function in order to decide the direction Notice how we did this for w, but we must do this for b as well
We repeat this until convergence! We update these two parameters simultaneously
We need to make sure our values of w or b do not change when trying to compute our new values. If we did this, our gradient descent would not function as intended, as whenever we update w or b, the following update of w or b would be altered.
This is due to its convex shape, looking like a bowl ![[Pasted image 20240630152844.png |500]]
This is Batch gradient descent, meaning that we use all training examples
## Do both simultaneously
dw_term = 0
db_term = 0
for i in range(m):
f_wb = w*x[i]+b
temp_dw_term = (f_wb-y[i])*x[i]
temp_db_term = (f_wb-y[i])
dw_term += temp_dw_term
db_term += temp_db_term
dw_term = dw_term/m
db_term = db_term/m
cost = 0
for i in range(m):
f_wb = w*x[i]+b
cost += (f_wb-y[i])**2
cost = 1/(2*m)*cost
b = b_initial
w = w_initial
for i in range(iters):
dw_term, db_term = gradient_function(x,y,w,b) # Find the gradients
b = b - alpha * db_term # step the b param
w = w - alpha * dw_term # step the w param
return w, b
We can always visualize the cost vs iterations graph but there is another way. We can use an [[Automatic convergence test]]
Here is [[Cost Function#the implementation of this cost function|The implementation of this cost function]]
We want to find , given a new , output , we can then make a prediction/estimate the probability.
We want to minimize , we can just do the normal gradient descent formula. and , so technically we end up with a similar looking formula:
It is still very different due to the different function within the summation. We can also always perform [[Feature scaling]] to speed up this process of [[Gradient Descent]]